clickhouse: prevent replicated tables from starting in read-only mode. #9183

jmcarp · 2025-10-09T15:24:44Z

On start, ClickHouse compares the local state of each distributed table to its distributed state. If it finds a discrepancy, it starts the table in read-only mode. When this happens, oximeter can't write new records to the relevant table(s). In the past, we've worked around this by manually instructing ClickHouse using the force_restore_data sentinel file, but this requires manual detection and intervention each time a table starts up in read-only mode. This patch sets the replicated_max_ratio_of_wrong_parts flag to 1.0 so that ClickHouse always accepts local state, and never starts tables in read-only mode.

As described in ClickHouse/ClickHouse#66527, this appears to be a bug, or at least an ergonomic flaw, in ClickHouse. One replica of a table can routinely fall behind the others, e.g. due to restart or network partition, and shouldn't require manual intervention to start back up.

Part of #8595.

Note: I'm not sure now best to test this. It sounds like we have reasonably high confidence that the fix will work, so we could just merge and deploy to dogfood, and revert if necessary. Or is clickhouse cluster running on another rack that we can test?

bnaecker · 2025-10-09T16:08:18Z

clickhouse-admin/types/src/config.rs

+        <!-- Disable sparse column serialization, which we expect to not need -->
        <ratio_of_defaults_for_sparse_serialization>1.0</ratio_of_defaults_for_sparse_serialization>
+
+        <!-- Prevent ClickHouse from setting distributed tables to read-only. -->


Doc nit, this is setting the local replica to read-only. I.e., it avoids updating the local state with the shared state in the Raft cluster.

bnaecker

Looks good. As far as testing, I think we should figure out how to check if this is being used on the Dogfood rack. That is, can we see in the logs that it would have set the table to read-only mode, but is not because of this setting?

Dogfood is the only place we run the replicated cluster.

bnaecker · 2025-10-09T16:09:36Z

so that ClickHouse always accepts local state, and never starts tables in read-only mode.

Just a nit here, it's so that ClickHouse always accepts the shared state, not the local state.

On start, ClickHouse compares the local state of each distributed table to its distributed state. If it finds a discrepancy, it starts the table in read-only mode. When this happens, oximeter can't write new records to the relevant table(s). In the past, we've worked around this by manually instructing ClickHouse using the `force_restore_data` sentinel file, but this requires manual detection and intervention each time a table starts up in read-only mode. This patch sets the `replicated_max_ratio_of_wrong_parts` flag to 1.0 so that ClickHouse always accepts shared state, and never starts tables in read-only mode. As described in ClickHouse/ClickHouse#66527, this appears to be a bug, or at least an ergonomic flaw, in ClickHouse. One replica of a table can routinely fall behind the others, e.g. due to restart or network partition, and shouldn't require manual intervention to start back up. Part of #8595.

jmcarp · 2025-10-14T14:04:03Z

I think we should figure out how to check if this is being used on the Dogfood rack. That is, can we see in the logs that it would have set the table to read-only mode, but is not because of this setting?

I think we can find some interesting log lines at https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/StorageReplicatedMergeTree.cpp#L1959-L1984. When we find mismatched parts and switch a table to read-only mode, we get a log like this:

/data/log/clickhouse.err.log:2025.10.06 16:54:33.442541 [ 106 ] {} <Error> oximeter.fields_uuid_local (ReplicatedMergeTreeAttachThread): Initialization failed, table will remain readonly. Error: Code: 231. DB::Exception: The local set of parts of table oximeter.fields_uuid_local (a336fe33-ed9d-423a-a34c-95218920986c) doesn't look like the set of parts in ZooKeeper: 350.82 thousand rows of 350.82 thousand total rows in filesystem are suspicious. There are 56 uncovered unexpected parts with 350822 rows (1 of them is not just-written with 232066 rows), 4 missing parts (with 4 blocks), 0 covered unexpected parts (with 0 rows). (TOO_MANY_UNEXPECTED_DATA_PARTS), Stack trace (when copying this message, always include the lines below):

When we don't throw an exception, we log a warning like this:

/data/log/clickhouse.err.log:2025.10.06 18:04:27.605091 [ 24 ] {} <Warning> oximeter.fields_uuid_local (a336fe33-ed9d-423a-a34c-95218920986c): The local set of parts of table oximeter.fields_uuid_local (a336fe33-ed9d-423a-a34c-95218920986c) doesn't look like the set of parts in ZooKeeper: 297.21 thousand rows of 297.21 thousand total rows in filesystem are suspicious. There are 30 uncovered unexpected parts with 297208 rows (1 of them is not just-written with 232066 rows), 4 missing parts (with 4 blocks), 0 covered unexpected parts (with 0 rows).

After rolling out this change, I think we should expect to see the latter log line in /data/log/clickhouse.err.log after dogfood updates, but not the former. Does that make sense?

Anyhow, I'm going to merge this PR, and we'll see what happens after the next deploy.

As a follow-up to #9183, update the replicated_max_ratio_of_wrong_parts setting in the relevant file. After merging the previous PR, we noticed that the change there didn't propagate to the config file in use on the rack. This patch applies that same update to the correct config file in smf/.

jmcarp requested a review from bnaecker October 9, 2025 15:24

jmcarp force-pushed the jmcarp/clickhouse-readonly-workaround branch from d84e49b to 11b96f2 Compare October 9, 2025 15:29

bnaecker reviewed Oct 9, 2025

View reviewed changes

bnaecker approved these changes Oct 9, 2025

View reviewed changes

jmcarp force-pushed the jmcarp/clickhouse-readonly-workaround branch from 11b96f2 to 3b34575 Compare October 13, 2025 19:45

jmcarp merged commit bea91a2 into main Oct 14, 2025
16 checks passed

jmcarp deleted the jmcarp/clickhouse-readonly-workaround branch October 14, 2025 14:04

jmcarp mentioned this pull request Oct 21, 2025

clickhouse: update replication config in the correct file. #9265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clickhouse: prevent replicated tables from starting in read-only mode. #9183

clickhouse: prevent replicated tables from starting in read-only mode. #9183

Uh oh!

jmcarp commented Oct 9, 2025

Uh oh!

bnaecker Oct 9, 2025

Uh oh!

bnaecker left a comment

Uh oh!

bnaecker commented Oct 9, 2025

Uh oh!

jmcarp commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clickhouse: prevent replicated tables from starting in read-only mode. #9183

clickhouse: prevent replicated tables from starting in read-only mode. #9183

Uh oh!

Conversation

jmcarp commented Oct 9, 2025

Uh oh!

bnaecker Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

bnaecker left a comment

Choose a reason for hiding this comment

Uh oh!

bnaecker commented Oct 9, 2025

Uh oh!

jmcarp commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants